Internet Archive Metadata

Metadata is data about data. In the case of Internet Archive items, the metadata describes the contents of the items. Metadata can include information such as the performance date for a concert, the name of the artist, and a set list for the event.

Metadata is a very important element of items in the Internet Archive. Metadata allows people to locate and view information. Items with little or poor metadata may never be seen and can become lost.

Note that metadata keys must be valid XML tags. Please refer to the XML Naming Rules section here.

All metadata for archive.org items are stored in <identifier>_meta.xml and <identifier>_files.xml. The meta.xml file contains all of the item-level metadata for an item (e.g. title, description, creator, etc.). The files.xml file contains all of the file-level metadata (e.g. track title, checksums, etc.). While these two files are the canonical sources of metadata for archive.org items, most users will interact with an item’s metadata via the metadata API. For example, nasa_meta.xml correlates to /metadata/nasa/metadata and nasa_files.xml to /metadata/nasa/files.

This document describes common metadata fields used on archive.org. Refer to the metadata API docs for more details on reading and writing metadata to items.

Archive.org Identifiers

Each item at Internet Archive has an identifier. An identifier is composed of a unique combination of alphanumeric characters (limited to ASCII), underscores (_), dashes (-), or periods (.). The first character of an identifier must be alphanumeric (e.g. it cannot start out with an underscore, dash, or period). The maximum length of an identifier is 100 characters, but we generally recommend that identifiers be between 5 and 80 characters in length.

Identifiers must be unique across the entirety of Internet Archive, not simply unique within a single collection.

Once defined an identifier can not be changed. It will travel with the item or object and is involved in every manner of accessing or referring to the item.

Custom Metadata Fields

Internet Archive strives to be metadata agnostic, enabling users to define the metadata format which best suits the needs of their material. In addition to the standard metadata fields listed above you may also define as many custom metadata fields as you require. These metadata fields can be defined ad hoc at item creation or metadata editing time and do not have to be defined in advance.

Metadata Schema

Below are _meta.xml fields that have special meaning on archive.org.

identifier

  • internal use only: No
  • usage notes: We encourage the use of human-readable identifiers, rather than opaque strings of numbers or letters. For most projects we try to keep identifiers below 80 characters in length for the sake of readability.
  • definition: Unique identifier for an item on the archive.org web site. Used in the URL for the item, ie archive.org/details/[identifier].
  • required: Yes
  • label: Item Identifier
  • repeatable: No
  • accepted values: String, minimum length is 5 characters, maximum length is 100 characters, contains only Roman alphabet characters, numbers, periods (.), underscores ( _ ), or dashes ( - ), and first character must be alphanumeric. mediatype:account items begin with @ symbol.
  • edit access: IA admin
  • defined by: uploader
  • example: SanFrancisco1955CinemascopeFilm

mediatype

  • internal use only: No
  • usage notes: texts: books, articles, newspapers, magazines, any documents with content that contains text etree: live music concerts, items should only be uploaded for artists with collections in the etree “Live Music Archive” community audio: any item where the main media content is audio files, like FLAC, mp3, WAV, etc. movies: any item where the main media content is video files, like mpeg, mov, avi, etc. software: any item where the main media content is software intended to be run on a computer or related device such as gaming devices, phones, etc. image: any item where the main media content is image files (but is not a book or other text item), like jpeg, gif, tiff, etc. data: any item where the main content is not media or web pages, such as data sets web: any item where the main content is copies of web pages, usually stored in WARC or ARC format collection: designates the item as a collection that can “contain” other items account: designates the item as being a user account page, can only be set by internal archive systems
  • definition: Mediatype tells us about the main content of the item. It is used to determine how the item is displayed on the web site and may trigger special processing depending on the types of files contained in the item.
  • required: Yes
  • label: Type of Media
  • repeatable: No
  • accepted values: texts etree audio movies software image data web collection account
  • edit access: IA admin
  • defined by: uploader
  • example: movies

title

  • internal use only: No
  • usage notes: All alphabets are supported
  • definition: Title of media
  • required: Recommended
  • label: Title
  • repeatable: No
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: San Francisco (1955 Cinemascope film)

collection

  • internal use only: No
  • usage notes: Required for all items except “fav-username” collections. Always list the item’s primary collection first in meta.xml; this is the collection the item “belongs” to. The primary collection often represents the entity that contributed or created the content. Uploaders can only choose from collections that they have privileges for. General uploaders with no special privs can only upload to selected “Community” collections or the test_collection. Items in the test_collection are removed from the site after 30 days.
  • definition: Indicates to the website what collection(s) this item belongs to.
  • required: No
  • label: Collections
  • repeatable: Yes
  • accepted values: Must be a valid identifier
  • edit access: user admin
  • defined by: uploader
  • example: prelinger

description

  • internal use only: No
  • usage notes: May be about the media content (e.g. a description of the book’s plot), the physical item it represents (e.g. missing or damaged pages in the physical book that was digitized), the creator of the media (e.g. author biographical info that relates to the book), or any other information that may help a user understand the item or its context. All alphabets are supported.
  • definition: Describes the media stored in the item.
  • required: Recommended
  • label: Item Description
  • repeatable: Yes
  • accepted values: String, can contain links, formatting and images in html/css
  • edit access: uploader
  • defined by: uploader
  • example: Cinemascope homage to the city of San Francisco made by amateur filmmaker and inventor Tullio Pellegrini.

uploader

  • internal use only: No
  • usage notes: The uploader field determines which account has full access to modify/edit/delete metadata and files from the item without having any special privileges granted. Any other account that wants to modify this item must have some level of administrative privilege granted by Internet Archive.
  • definition: Email address of the account that uploaded the item to archive.org.
  • required: Yes
  • label: Item Uploader
  • repeatable: No
  • accepted values: Email address
  • edit access: IA admin
  • defined by: uploader
  • example: footage@panix.com

subject

  • internal use only: No
  • usage notes: Books and other media objects from libraries often use Library of Congress Subject Headings, http://id.loc.gov/authorities/subjects.html. Some collections may use their own controlled vocabulary for setting subjects. Many other items use the subject field as more casual “tags.” All alphabets are supported.
  • definition: Subjects and/or topics covered by the media content
  • required: No
  • label: Subject/Keyword
  • repeatable: Yes
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: France

date

  • internal use only: No
  • usage notes: We encourage people to use YYYY, YYYY-MM, or YYYY-MM-DD for this field, but sometimes exact dates are not possible to determine. Other common usages: [YYYY] (brackets) when a date is not certain; c.a. YYYY (c.a.) when a date is approximate; and [n.d.] when a date is unknown (you may also leave the field blank in this case). Books, movies, and CDs often only have YYYY for a publication date. Magazines often have YYYY-MM for a publication date. Concerts and articles often have YYYY-MM-DD publication dates. Use the most specific verifiable date you have access to. When the item is a digitial representation of a physical piece of media (e.g. a book, a 78rpm disc, etc.) the publication date should represent the date that the specific physical item was published. A book may have been written in 1850, and then an edition was republished in 1885. If the digitized version is the edition republished in 1885, use 1885 as the publication date (not 1850).
  • definition: Date of publication
  • required: No
  • label: Publication Date
  • repeatable: No
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: 1965, 2013-05-25, [n.d.]

contributor

  • internal use only: No
  • usage notes: For physical items that have been digitized, contributor represents the library or other organization that owns the physical item. For born-digital media, contributor often represents the organization responsible for the distribution of the content (e.g. a radio station or television station).
  • definition: The person or organization that provided the physical or digital media.
  • required: No
  • label: Contributor
  • repeatable: No
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: Robarts - University of Toronto

creator

  • internal use only: No
  • usage notes: For items provided by libraries, the creator is often listed using the Library of Congress Name Authority Headings, http://authorities.loc.gov/ For items from other sources, the creator is often listed as first name and surname. When an item was created by an organization, such as a government agency or a production company, use the full name of the organization. This field represents the entity who created the media, not the person who uploaded the media to archive.org (though these may be the same person). All alphabets supported.
  • definition: The individual(s) or organization that created the media content.
  • required: No
  • label: Creator
  • repeatable: Yes
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: Austen, Jane, 1775-1817, Ralph Burns

language

  • internal use only: No
  • usage notes: For items provided by libraries, the language is often provided as a 3 letter MARC language code (e.g. eng), https://www.loc.gov/marc/languages/ For other items, the language is often written out as the full name (e.g. English). Not all items have a language associated with them (e.g. instrumental music), but when there is written or spoken language we are able to do some sorts of processing better when we know the language. When an item contains no OCRable content, you will sometimes see the language set to zxx. Language is particularly important for text items so that we can do the best job with optical character recognition processing.
  • definition: The language the media is written or recorded in.
  • required: No
  • label: Language
  • repeatable: Yes
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: eng, Italian

addeddate

  • internal use only: No
  • usage notes: Deprecated. YYYY-MM-DD HH:MM:SS. Addeddate is automatically set when the item directory is been created in our file system. In many cases, the addeddate will be very similar to the publicdate. However, in some cases we create an item directory with metadata but no media files prior to the media being scanned. The addeddate reflects when the item was created, regardless of when the media was added to the item. When the media is added at a later date and a derive.php task is run, the publicdate will be added to the item.
  • definition: Date and time in UTC that the item was created archive.org
  • required: Deprecated
  • label: Date Added
  • repeatable: No
  • accepted values: String
  • edit access: not editable
  • defined by: IA software
  • example: 2017-03-28 22:05:46

publicdate

  • internal use only: No
  • usage notes: Publicdate is automatically set when the archive.php task finishes.
  • definition: The date and time in UTC that the item was made public on archive.org.
  • required: Yes
  • label: Date Archived
  • repeatable: No
  • accepted values: YYYY-MM-DD HH:MM:SS YYYY-MM-DD
  • edit access: not editable
  • defined by: IA software
  • example: 2011-12-25 19:01:43

scandate

  • internal use only: No
  • usage notes: When an physical item is scanned/digitized, scandate represents the date/time that the digitization occurred. For web items, scandate represents the date/time the first WARC file in the item was created. For TV and radio items, scandate represents the begining time of the recording. Formats: YYYYMMDDHHMMSS YYYYMMDD YYYY YYYY-MM-DD HH:MM:SS
  • definition: The date and time in UTC that the media was captured.
  • required: No
  • label: Scan Date
  • repeatable: No
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: 20170329201345

imagecount

  • internal use only: No
  • usage notes: Texts: represents number of page images in the item TV: represents number of seconds of video in the item Web: represents number of URIs captured in the WARCs in the item CD: number of images of physical item and accompanying materials
  • definition: Imagecount gives an indication of the size of the content of an item (outside of file size, which is represented in the size field). Originally used only for books, the field has been repurposed over time to provide similar information for other mediatypes.
  • required: No
  • label: Image Count
  • repeatable: No
  • accepted values: Positive whole number
  • edit access: IA software
  • defined by: IA software
  • example: 230

year

  • internal use only: No
  • usage notes: Deprecated, use date field
  • definition: Deprecated, use date field
  • required: Deprecated
  • label: Year of Publication
  • repeatable: No
  • accepted values: YYYY
  • edit access: uploader
  • defined by: uploader
  • example: 1996

scanner

  • internal use only: No
  • usage notes: Primarily an internally used field. For digitized texts this represents the individual digitization station (e.g. Scribe 2 in the New Jersey center). For web items this represents the crawl machine used to gather the data. For films this represents the film scanner. For CDs this represents the version of the scanning software used for that CD. For end-user contriuted items, this represents the software used to upload the item.
  • definition: Machinery used to digitize or collect the media
  • required: No
  • label: Scanner
  • repeatable: No
  • accepted values: String
  • edit access: IA admin
  • defined by: IA software
  • example: scribe2.nj.archive.org, selenium-101.us.archive.org, Lasergraphics Scanstation, ArchiveCD Version 2.1.15, Internet Archive HTML5 Uploader 1.6.3

source

  • internal use only: No
  • usage notes: Used to signify where a piece of media originated, or what the physical media was prior to digitization. - Focused crawl items list the site being crawled in this field. - Texts digitization centers use the field to denote folios. - TV uses the field to indicate the signal source. - Internal audio digitization projects use this field to indicate the format of the original media (CD, LP, 78, etc.). - External users often use this field to list a URL where the media content originated. - Etree users use it to record the “path” for recording a live concert.
  • definition: Source of media
  • required: No
  • label: Source
  • repeatable: No
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: folio, Comcast Cable, CD, DPA 4021 > SX-M2 > SD 744T @ 44.1 kHZ/16 bit

repub_state

  • internal use only: No
  • definition: Indicates the current state of a scanned book.
  • required: No
  • label: Repub State
  • repeatable: No
  • accepted values: Whole number
  • edit access: IA software
  • defined by: IA software
  • example: 19

access-restricted

  • internal use only: Yes
  • usage notes: This tag is only used on items of mediatype collection (it will have no affect on items of any other type). This tag should only be assigned by internal IA admins.
  • definition: Collection contents are restricted access
  • required: No
  • label: Access Restricted
  • repeatable: No
  • accepted values: true
  • edit access: user admin
  • defined by: user admin
  • example: true

public-format

  • internal use only: Yes
  • usage notes: This tag only affects items of mediatype collection and must be used in conjunction with the access-restricted tag. This tag should only be assigned by internal IA admins.
  • definition: Collection file formats that are available to users in an Access Restricted collection
  • required: No
  • label: Public Formats
  • repeatable: Yes
  • accepted values: String
  • edit access: user admin
  • defined by: user admin
  • example: Metadata

access-restricted-item

  • definition: Identifies item that is access-restricted
  • usage notes: Only used on items, not collections. Automatically added to items in an access-restricted collection at the end of any task.
  • required: No
  • label: Access Restricted Item
  • repeatable: No
  • accepted values: true
  • example: true

scanningcenter

  • internal use only: No
  • usage notes: Generally used in conjunction with our scanning services, this tag gives the location where an item was digitized, scanned or captured.
  • definition: The location where a digital copy of the media item was created
  • required: No
  • label: Scanning Center
  • repeatable: No
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: boston

ocr

  • internal use only: No
  • usage notes: Set during derivation process.
  • definition: Software package and version used for optical character recognition
  • required: No
  • label: OCR Software
  • repeatable: No
  • accepted values: String
  • edit access: IA admin
  • defined by: IA software
  • example: ABBYY FineReader 8.0

noindex

  • internal use only: Yes
  • usage notes: While the accepted practice is to have a value of “true” for this tag, the mere presence of the tag in meta.xml will actually cause the same effect regardless of the value used (including empty). In addition to not being included in the public archive.org search engine, the noindex tag will also cause the item to not be listed in the sitemap.
  • definition: Prevents item from being indexed in public archive.org search engine
  • required: No
  • label: No Index
  • repeatable: No
  • accepted values: true
  • edit access: uploader
  • defined by: uploader
  • example: true

ppi

  • internal use only: No
  • usage notes: Indicates pixels per inch for an image. The most common use case is Internet Archive digitization centers. This number is set during the book scanning process.
  • definition: Pixels per inch
  • required: No
  • label: PPI
  • repeatable: No
  • accepted values: Positive whole number
  • edit access: uploader
  • defined by: uploader
  • example: 300

curation

  • internal use only: Yes
  • usage notes: Curation is a compound field with “sub-fields”: curator, date, state, and comment. - Curator is the email address of the person who added the curation tag. - Date is the UTC time and date the curation tag was added, in YYYYMMDDHHMMSS format. - State can be: dark, approved, freeze, un-dark or blank - Comment can be a code used by the scanning center team to indicate issues found during QA, or a text string with some other curation comment (e.g. information about why an item was frozen or made dark). Items uploaded into open collections are generally checked by malware detection software, and the curation field will contain the results of that check.
  • definition: Curation state and notes
  • required: No
  • label: Curation
  • repeatable: No
  • accepted values: String
  • edit access: IA admin
  • defined by: IA admin
  • example: [curator]lenscriv@archive.org[/curator][date]20160504125613[/date][state]approved[/state][comment]199[/comment], [curator]malware@archive.org[/curator][date]20140321085621[/date][comment]checked for malware[/comment]

runtime

  • internal use only: No
  • usage notes: Uploader can set this field, but most often we have determined and set this value during the derive process.
  • definition: Length of an audio or video item
  • required: No
  • label: Run Time
  • repeatable: Yes
  • accepted values: HH:MM:SS H:MM:SS MM:SS M:SS 0:SS
  • edit access: uploader
  • defined by: uploader
  • example: 00:15:00, 2:12, 0:23

publisher

  • internal use only: No
  • usage notes: - Books use publisher - Movies often use production company - Music often uses record label
  • definition: Publisher of the media
  • required: No
  • label: Publisher
  • repeatable: No
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: New York : R.R. Bowker Co.

sound

  • internal use only: No
  • usage notes: Most used values are: sound, silent Mostly used for video items, this field indicates whether the media has related sound or is silent.
  • definition: Indicates whether media has sound or is silent
  • required: No
  • label: Sound
  • repeatable: No
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: sound, silent

color

  • internal use only: No
  • usage notes: Most used values are: color, B&W (black and white) Mostly used for video items, indicates whether video is color or black and white. Can be used to indicate different kinds of color (e.g. Kodachrome).
  • definition: Indicates whether media is in color or black and white
  • required: No
  • label: Color
  • repeatable: No
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: color

start_localtime

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: Start time of program in broadcast time zone
  • required: No
  • label: Local Start Time
  • repeatable: No
  • accepted values: YYYY-MM-DD HH:MM:SS
  • edit access: IA admin
  • defined by: IA software
  • example: 2010-03-26 18:00:00

start_time

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: Start time of program in UTC
  • required: No
  • label: UTC Start Time
  • repeatable: No
  • accepted values: YYYY-MM-DD HH:MM:SS
  • edit access: IA admin
  • defined by: IA software
  • example: 2010-03-26 15:00:00

stop_time

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: Stop time of program in UTC
  • required: No
  • label: UTC Stop Time
  • repeatable: No
  • accepted values: YYYY-MM-DD HH:MM:SS
  • edit access: IA admin
  • defined by: IA software
  • example: 2010-03-26 16:00:00

utc_offset

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: Offset between local time and UTC
  • required: No
  • label: UTC Offset
  • repeatable: No
  • accepted values: Whole number
  • edit access: IA admin
  • defined by: IA software
  • example: 300, -400

audio_codec

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: Program used to decode audio stream
  • required: No
  • label: Audio Codec
  • repeatable: No
  • accepted values: String
  • edit access: IA admin
  • defined by: IA software
  • example: ac3

audio_sample_rate

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: Samples per second
  • required: No
  • label: Audio Sample Rate
  • repeatable: No
  • accepted values: Whole number
  • edit access: IA admin
  • defined by: IA software
  • example: 48000

video_codec

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: Program used to decode video stream
  • required: No
  • label: Video Codec
  • repeatable: No
  • accepted values: String
  • edit access: IA admin
  • defined by: IA software
  • example: mpeg2video

frames_per_second

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: Frequency at which consecutive images are displayed
  • required: No
  • label: Frames Per Second
  • repeatable: No
  • accepted values: Number
  • edit access: IA admin
  • defined by: IA software
  • example: 29.97

source_pixel_width

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: Pixel width of original video stream
  • required: No
  • label: Source Pixel Width
  • repeatable: No
  • accepted values: Whole number
  • edit access: IA admin
  • defined by: IA software
  • example: 704

source_pixel_height

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: Pixel height of original video stream
  • required: No
  • label: Source Pixel Height
  • repeatable: No
  • accepted values: Whole number
  • edit access: IA admin
  • defined by: IA software
  • example: 480

aspect_ratio

  • internal use only: No
  • usage notes: Standard values for this field are 4:3 and 16:9, but other values are possible.
  • definition: Ratio of the pixel width and height of a video stream
  • required: No
  • label: Aspect Ratio
  • repeatable: No
  • accepted values: #:#
  • edit access: IA admin
  • example: 4:3

closed_captioning

  • internal use only: No
  • usage notes: Field is generally only present when the video has closed captioning. When captioning is not present, the field may have “no” as the value, or just not be included in meta.xml
  • definition: Indicates whether item contains closed captioning files
  • required: No
  • label: Closed Captioning
  • repeatable: No
  • accepted values: yes no
  • edit access: IA admin
  • example: yes, no

ccnum

  • internal use only: No
  • usage notes: Primarily used for TV Archive items. Closed captioning files are stored as [identifier].cc#.txt in the item. This tag indicates which cc# file to display in item and use for search indexing.
  • definition: Indicates which closed captioning file should be used for display and search
  • required: No
  • label: Closed Captioning Number
  • repeatable: No
  • accepted values: cc# asr ocr #
  • edit access: IA admin
  • defined by: IA software
  • example: cc5

tuner

  • internal use only: No
  • usage notes: Primarily used for TV Archive items. Maps the program number as used in H.222 Program Association Tables and Program Mapping Tables to a channel number that can be entered via digits on a receiver’s remote control.
  • definition: Virtual Channel the video was recorded from
  • required: No
  • label: Tuner
  • repeatable: No
  • accepted values: String
  • edit access: IA admin
  • defined by: IA software
  • example: Virtual Ch. 24

updater

  • internal use only: No
  • usage notes: After initial upload, when changes are made to the content of an item the account that made changes is included in the meta.xml in an field. Updater fields are added to meta.xml in the order changes have been made, so the first listed updater belongs to the oldest modification.
  • definition: Screen name of the account that updated the item
  • required: No
  • label: Updater
  • repeatable: Yes
  • accepted values: String
  • edit access: not editable
  • defined by: IA software
  • example: tracey pooh

updatedate

  • internal use only: No
  • usage notes: Any time an item is changed via the editxml page by the updater (see field), a corresponding field is added to the meta.xml. Updatedate fields are added to meta.xml in the order changes have been made, so the oldest dates are listed first.
  • definition: Date the item was updated by updater
  • required: No
  • label: Update Date
  • repeatable: Yes
  • accepted values: YYYY-MM-DD HH:MM:SS
  • edit access: not editable
  • defined by: IA software
  • example: 2009-03-02 21:48:28

updated

  • internal use only: No
  • usage notes: Timestamp is typically when the last task ran on the item, but metadata updates can also be triggered manually.
  • definition: Timestamp in the metadata table for the last time the item’s row in that table was written
  • required: No
  • label: Updated
  • repeatable: Yes
  • accepted values: YYYY-MM-DD
  • edit access: not editable
  • defined by: IA software
  • example: 2014-12-05

operator

  • internal use only: No
  • usage notes: usually email address. In texts this represents the person who operated the Scribe or other scanning equipment. In web items this represents the engineer responsible for the crawl.
  • definition: Email of the person who scanned/captured the media in the item
  • required: No
  • label: Operator
  • repeatable: No
  • accepted values: String
  • edit access: IA admin
  • defined by: IA software
  • example: associate-stephanie-kinsey@archive.org

foldoutcount

  • internal use only: No
  • usage notes: Fold outs are photographed on machinery other than the Scribe. This field indicates how many foldouts were captured. The value may be 0 or higher.
  • definition: Number of fold outs captured by operator
  • required: No
  • label: Fold Out Count
  • repeatable: No
  • accepted values: Whole number
  • edit access: IA admin
  • defined by: IA software
  • example: 1

external-identifier

  • internal use only: No
  • usage notes: External-identifier includes Uniform Resource Names (URNs) for external resources about the media in the item. The field is usually in the form of urn:namespace:identifier.
  • definition: URLs or identifiers to outside resources that represent the media
  • required: No
  • label: External Identifier
  • repeatable: Yes
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: urn:publisher_catalog_id:88697 03614 2, urn:pubcat:victor:18890-B, urn:spotify:album:3jmETApVCjXb3hWTR1IEdH, urn:asin:0451531396, acs:epub:urn:uuid:d935586b-72a7-4720-bbb9-72fe75eae0e1, urn:acs6:blackreconstruc00dubo:epub:38413c16-074b-4fb6-a4dc-25e93e199d5f, urn:mb_artist_id:6de0f914-3e60-4418-be3b-42e0feb6eb4d, urn:X-pwacrawlid:AWP5

page-progression

  • internal use only: No
  • usage notes: lr = left to right rl = right to left
  • definition: Determines direction pages will be “turned” in a book
  • required: No
  • label: Page Progression
  • repeatable: No
  • accepted values: lr rl
  • edit access: uploader
  • defined by: uploader
  • example: rl, lr

previous_item

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: IA identifier of previous item from a recorded feed
  • required: No
  • label: Previous Item
  • repeatable: No
  • accepted values: identifier
  • edit access: IA admin
  • defined by: IA software
  • example: BBCNEWS_20121204_060000_Breakfast

next_item

  • internal use only: No
  • usage notes: Primarily used for TV Archive items.
  • definition: IA identifier of next item from a recorded feed
  • required: No
  • label: Next Item
  • repeatable: No
  • accepted values: identifier
  • edit access: IA admin
  • defined by: IA software
  • example: BBCNEWS_20121204_090000_BBC_News

licenseurl

  • internal use only: No
  • usage notes: This link should point to a recognized license, like Creative Commons or GNU. For other types of rights statements, use the field.
  • definition: URL of the selected license
  • required: No
  • label: License URL
  • repeatable: No
  • accepted values: URL
  • edit access: uploader
  • defined by: uploader
  • example: http://creativecommons.org/licenses/by-nd/3.0/

sponsordate

  • internal use only: Yes
  • usage notes: Related to digitization work. Usually a date string “YYYYMMDD”, but can contain notes, such as: “not to be invoiced-past billing period” “Grant ended, item not yet invoiced” “sent20111010”
  • definition: Billing date for scanned materials
  • required: No
  • label: Sponsor Date
  • repeatable: No
  • accepted values: String
  • edit access: IA admin
  • defined by: IA admin
  • example: 20100531

boxid

  • internal use only: Yes
  • usage notes: Boxids always start with the letters IA followed by numbers. The numbers represent the container, pallet and box that the physical item is stored in. When there are multiple boxid fields in meta.xml, the first boxid listed represents the physical item that was digitized. Subsequent boxid fields represent the location of duplicate physical items.
  • definition: Location of physical item in the Physical Archive
  • required: No
  • label: Box ID
  • repeatable: Yes
  • accepted values: IA######
  • edit access: IA admin
  • defined by: IA admin
  • example: IA158001

bookreader-defaults

  • internal use only: No
  • usage notes: The bookreader defaults to showing books in 2up mode, so this field is generally only used to indicate that an item should be displayed in 1up mode (showing only one page at a time in the bookreader).
  • definition: Indicates whether the bookreader should display one or two pages by default
  • required: No
  • label: Bookreader defaults
  • repeatable: No
  • accepted values: mode/1up mode/2up
  • edit access: uploader
  • defined by: uploader
  • example: mode/1up

betterpdf

  • internal use only: No
  • usage notes: This field is either set to the value true, or is not included in meta.xml. If this field is included after the initial derive is run, user should also run a derive task to create the better quality PDF.
  • definition: Indicates that the derive module should create a higher quality PDF derivative (distinguishes text from background better).
  • required: No
  • label: Better PDF
  • repeatable: No
  • accepted values: true
  • edit access: uploader
  • defined by: uploader
  • example: true

republisher

  • internal use only: Yes
  • usage notes: This field is deprecated.
  • definition: Deprecated. Email of the person who completed republishing the item
  • required: Deprecated
  • label: Republisher
  • repeatable: No
  • accepted values: email address
  • edit access: IA software
  • defined by: IA software
  • example: associate-kiana-fekette@archive.org

republisher_operator

  • internal use only: Yes
  • usage notes: Set by Scribe3 software.
  • definition: Email of the person who completed republishing the item
  • required: No
  • label: Republisher Operator
  • repeatable: No
  • accepted values: email address
  • edit access: IA software
  • defined by: IA software
  • example: associate-kiana-fekette@archive.org

republisher_date

  • internal use only: Yes
  • usage notes: Set by Scribe3 software.
  • definition: Date and time in UTC that the item was created archive.org
  • required: No
  • label: Republisher Date
  • repeatable: No
  • accepted values: YYYYMMDDHHMMSS
  • edit access: IA software
  • defined by: IA software
  • example: 20170801165730

republisher_time

  • internal use only: Yes
  • usage notes: Set by Scribe3 software.
  • definition: Number of seconds required to republish text
  • required: No
  • label: Republisher Time
  • repeatable: No
  • accepted values: whole number
  • edit access: IA software
  • defined by: IA software
  • example: 504

camera

  • internal use only: No
  • definition: Camera model used during digitization process
  • required: No
  • label: Camera
  • repeatable: No
  • accepted values: String
  • edit access: IA admin
  • defined by: IA software
  • example: Canon 5D

oclc-id

  • internal use only: No
  • definition: Identifier of same edition in OCLC records
  • required: No
  • label: OCLC ID
  • repeatable: Yes
  • accepted values: String
  • edit access: uploader
  • defined by: uploader
  • example: 37432884

hidden

  • internal use only: Yes
  • usage notes: This tag only functions on items of mediatype collection.
  • definition: Hides collection from top level navigation
  • required: No
  • label: Hidden
  • repeatable: No
  • accepted values: true
  • edit access: IA admin
  • defined by: IA admin
  • example: true

identifier-ark

  • internal use only: No
  • usage notes: ARKs are URLs designed to support long-term access to information objects. We store the ark:/NAAN/Name portion of the URL in meta.xml. This can be tacked on to any ARK resolver’s domain to resolve the ARK, i.e. http://n2t.net/. Read about ARKs: http://n2t.net/e/ark_ids.html ARK specification: http://n2t.net/e/arkspec.txt
  • definition: Archival Resource Key identifier
  • required: No
  • label: ARK
  • repeatable: No
  • accepted values: ark:/NAAN/Name
  • edit access: uploader
  • defined by: uploader
  • example: ark:/13960/t4rj5fk7h

openlibrary

  • internal use only: No
  • usage notes: This field is deprecated. Please use openlibrary_edition.
  • definition: Deprecated. Open Library edition identifier
  • required: Deprecated
  • label: Open Library Identifier
  • repeatable: No
  • accepted values: OL#M
  • edit access: uploader
  • defined by: uploader
  • example: OL2769393M

openlibrary_edition

  • internal use only: No
  • usage notes: Correlates to the edition page on openlibrary.org. The OL edition page URL is https://openlibrary.org/books/[openlibrary_edition]
  • definition: Open Library edition identifier
  • required: No
  • label: Open Library edition identifier
  • repeatable: No
  • accepted values: OL#M
  • edit access: uploader
  • defined by: uploader
  • example: OL2769393M

openlibrary_work

  • internal use only: No
  • usage notes: Correlates to the work page on openlibrary.org. The OL work page URL is https://openlibrary.org/works/[openlibrary_edition]
  • definition: Open Library work identifier
  • required: No
  • label: Open Library work identifier
  • repeatable: No
  • accepted values: OL#W
  • edit access: uploader
  • defined by: uploader
  • example: OL675783W

openlibrary_subject

  • internal use only: No
  • usage notes: This field is currently used to supply books for carousels on the openlibrary.org home page. At some point it will also be used to import subjects from the openlibrary_work associated with the item.
  • definition: Open Library subject
  • required: No
  • label: Open Library subject
  • repeatable: Yes
  • accepted values: string
  • edit access: uploader
  • defined by: uploader
  • example: openlibrary_staff_picks

openlibrary_author

  • internal use only: No
  • usage notes: Correlates to the edition page on openlibrary.org. The OL edition page URL is https://openlibrary.org/books/[openlibrary_edition]
  • definition: Open Library author
  • required: No
  • label: Open Library author
  • repeatable: Yes
  • accepted values: OL#A
  • edit access: uploader
  • defined by: uploader
  • example: OL52922A

volume

  • internal use only: No
  • usage notes: This field is not overwritten by MARC
  • definition: Volume number or name
  • required: No
  • label: Volume
  • repeatable: No
  • accepted values: string
  • edit access: uploader
  • defined by: uploader
  • example: 15

call_number

  • internal use only: No
  • definition: Contributing library’s local call number
  • required: No
  • label: Call Number
  • repeatable: No
  • accepted values: string
  • edit access: uploader
  • defined by: uploader
  • example: 6675707, NC 285.1 P9287m

scanfee

  • internal use only: Yes
  • usage notes: Set by software based on parameters for each scanning partner
  • definition: Scanning fee used during billing process
  • required: No
  • label: Scan Fee
  • repeatable: No
  • accepted values: string
  • edit access: IA admin
  • defined by: IA software
  • example: 100, 300;10;200, 0;1.45;0

lccn

  • internal use only: No
  • usage notes: https://www.loc.gov/marc/lccn_structure.html
  • definition: Library of Congress Call Number
  • required: No
  • label: LCCN
  • repeatable: Yes
  • accepted values: Whole number
  • edit access: uploader
  • defined by: uploader
  • example: 2004045278

isbn

  • internal use only: No
  • usage notes: https://www.iso.org/standard/65483.html https://www.isbn.org/faqs_general_questions#isbn_faq5
  • definition: ISBN-10 or ISBN-13
  • required: No
  • label: ISBN
  • repeatable: Yes
  • accepted values: string of 10 or 13 digits. Final digit can be [0-9] or ‘X’
  • edit access: uploader
  • defined by: uploader
  • example: 3540212507, 031294716X

viruscheck

  • internal use only: Yes
  • usage notes: This tag only functions on items of mediatype collection. The tag is either present with a value of “true” or it should not be present in the item metadata at all. Currently all items uploaded into the open community collections have the virus check task run on them, without needing this tag. Any other collection that needs virus checking should have this tag present in order to trigger the virus check task to run on items uploaded into the collection.
  • definition: Causes virus check task to run on any item added to the collection
  • required: No
  • label: Virus Check
  • repeatable: No
  • accepted values: true
  • edit access: IA admin
  • defined by: IA admin

lastfiledate

  • internal use only: No
  • repeatable: No
  • required: No

firstfiledate

  • internal use only: No
  • repeatable: No
  • required: No

condition

  • definition: condition of media
  • usage notes: Defines the condition of the media in an item. In 78s and LPs this indicates the condition of the disc.
  • required: No
  • label: Condition
  • repeatable: No
  • accepted values: Near Mint Very Good Good Fair Worn Poor Fragile

condition-visual

  • definition: condition of the artwork or printed materials that accompany a media item
  • usage notes: Defines the condition of the artwork or printed materials that accompany the media in an item. In LPs this is used for album covers and sleeves.
  • required: No
  • label: Visual Condition
  • repeatable: No
  • accepted values: Near Mint Very Good Good Fair Worn Poor Fragile

File Metadata Schema

Below are _files.xml fields that have special meaning on archive.org.

name

  • definition: uploader
  • usage notes: Path to file name
  • required: yes
  • label: File Name
  • repeatable: no
  • accepted values: string
  • example: WHPT_102_5_FM_20180717_110000.mp3, noti080130.thumbs/noti080130_001134.jpg

source

  • definition: system
  • usage notes: Indicates whether the file was uploaded by a user (original) or generated by the system (derivative or metadata)
  • required: yes
  • label: File Source
  • repeatable: no
  • accepted values: original derivative metadata
  • example: original

format

  • definition: system
  • usage notes: Indicates the type (format) of file. See tab “File Formats” for examples
  • required: yes
  • label: File Format
  • repeatable: no
  • accepted values: string
  • example: Thumbnail

md5

  • definition: system
  • usage notes: Cryptographic hash used to verify contents of file.
  • required: yes
  • label: Message Digest 5 checksum
  • repeatable: no
  • accepted values: 32-length hex digest
  • example: 5376a4d222f6b2938f79a5843224cd42

size

  • definition: system
  • usage notes: File size in bytes
  • required: yes
  • label: File Size
  • repeatable: no
  • accepted values: integer
  • example: 31985916

mtime

  • definition: system
  • usage notes: Unix timestamp indicating when file was last modified.
  • required: yes
  • label: modified time
  • repeatable: no
  • accepted values: integer
  • example: 1531829145

sha1

  • definition: system
  • usage notes: Cryptographic hash used to verify contents of file.
  • required: yes
  • label: Secure Hash Algorithm 1
  • repeatable: no
  • accepted values: 40-length hex digest
  • example: eec164bd3fca541beb3bb092826e8b47a8c91385

crc32

  • definition: system
  • usage notes: Error-detecting code used to verify contents of file.
  • required: yes
  • label: Cyclic Redundancy Check
  • repeatable: no
  • accepted values: 8-length hex digest
  • example: 0ed6ddef

original

  • definition: system
  • usage notes: For derivative files, this indicates the original file the derive was performed upon.
  • required: no
  • label: Original File
  • repeatable: no
  • accepted values: string
  • example: WHPT_102_5_FM_20180717_110000.mp3

private

  • definition: system
  • usage notes: Indicates that access to this file is restricted to users with appropriate permissions
  • required: no
  • label: Restricted Access
  • repeatable: no
  • accepted values: true
  • example: true

length

  • definition: uploader or system
  • usage notes: Run length (in seconds).
  • required: no
  • label: Run Length
  • repeatable: no
  • example: 274.97, 04:35

width

  • definition: uploader or system
  • usage notes: Image width (in pixels)
  • required: no
  • label: Image Width
  • repeatable: no
  • accepted values: integer
  • example: 800

height

  • definition: uploader or system
  • usage notes: Image height (in pixels)
  • required: no
  • label: Image Height
  • repeatable: no
  • accepted values: integer
  • example: 600

title

  • definition: uploader
  • usage notes: Song or track title in audio files.
  • required: no
  • label: Track Title
  • repeatable: no
  • accepted values: string

rotation

  • definition: uploader
  • usage notes: represents the rotation (in degrees) of the camera. Typical values are: 0, 90, 180, 270
  • required: no
  • label: Camera Rotation
  • repeatable: no
  • accepted values: integer

track

  • definition: uploader
  • usage notes: Track number of album. Usually an integer, but occasionally has values like “2/14” (track 2 of 14)
  • required: no
  • label: Track Number
  • repeatable: no
  • accepted values: integer or string
  • example: 1

album

  • definition: uploader
  • usage notes: Name of album for audio items
  • required: no
  • label: Album Name
  • repeatable: no
  • accepted values: string

btih

  • definition: system
  • required: no
  • label: BitTorrent Info Hash
  • repeatable: no
  • accepted values: string

bitrate

  • repeatable: no
  • required: no
  • label: Bit Rate

creator

  • repeatable: no
  • required: no
  • example: Grateful Dead
  • label: Media Creator

artist

  • repeatable: no
  • required: no
  • usage notes: Similar to “creator”. This is a common tag used in audio files.
  • label: Media Artist

genre

  • repeatable: no
  • required: no
  • label: Music Genre

external-identifier

  • repeatable: no
  • required: no
  • label: External Identifier

comment

  • repeatable: no
  • required: no

original-name

  • repeatable: no
  • required: no

ctime

  • repeatable: no
  • required: no

atime

  • repeatable: no
  • required: no

wb_filtered

  • repeatable: no
  • required: no

publisher

  • repeatable: no
  • required: no

matrix_number

  • repeatable: no
  • required: no

collection-catalog-number

  • repeatable: no
  • required: no

operatingsystem

  • repeatable: no
  • required: no

autoplay

  • repeatable: no
  • required: no

external-identifier-match-date

  • definition: uploader
  • repeatable: yes
  • required: no
  • example: youtube:2018-07-19T04:36:11Z